Structured document indexing and searching

ABSTRACT

Searching for data contained in a structured data structure. A method includes receiving a query. The query includes a structured data structure path and a first element related to the structured data structure path. One or more patterns are created comprising at least a portion of the structured data structure path and one or more elements related to the first element. For each of the one or more patterns, a hash is created. The created hashes are looked-up in a hash index to identity one or more structured data structures correlated to the hashes. The one or more structured data structures are identified to a user.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No. 15/607,058 filed on May 26, 2017, titled “Structured Document Indexing and Searching,” which claims priority to U.S. Provisional Application 62/342,072, titled “Structured Document Indexing and Searching,” filed on May 26, 2016, the entirety of each of which are incorporated herein by reference.

BACKGROUND Background and Relevant Art

Computers and computing systems have affected nearly every aspect of modern living. Computers are generally involved in work, recreation, healthcare, transportation, entertainment, household management, etc.

Computing systems are particularly useful for the creation and manipulation of data. For example, data can be stored in databases. Many computer systems store data in structured data structures. Such structured data structures are hierarchical data structures used for storing and organizing data. One example of such structured data structures includes xml data structures.

Searching and retrieving data from xml data structures can be difficult due to the nature of how xml data structures are accessed. In particular, searching an xml data structure may require a large amount of data parsing and data processing.

The subject matter claimed herein is not limited to embodiments that solve any disadvantages or that operate only in environments such as those described above. Rather, this background is only provided to illustrate one exemplary technology area where some embodiments described herein may be practiced.

BRIEF SUMMARY

One embodiment illustrated herein is a method that may be practiced in a data storage environment. The method includes acts for searching for data contained in a structured data structure. The method includes receiving a query. The query includes a structured data structure path and a first element related to the structured data structure path. One or more patterns are created comprising at least a portion of the structured data structure path and one or more elements related to the first element. For each of the one or more patterns, a hash is created. The created hashes are looked-up in a hash index to identity one or more structured data structures correlated to the hashes. The one or more structured data structures are identified to a user.

This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

Additional features and advantages will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the teachings herein. Features and advantages of the invention may be realized and obtained by means of the instruments and combinations particularly pointed out in the appended claims. Features of the present invention will become more fully apparent from the following description and appended claims, or may be learned by the practice of the invention as set forth hereinafter.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to describe the manner in which the above-recited and other advantages and features can be obtained, a more particular description of the subject matter briefly described above will be rendered by reference to specific embodiments which are illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments and are not therefore to be considered to be limiting in scope, embodiments will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:

FIG. 1 illustrates a data indexing system;

FIG. 2 illustrates a data retrieval system;

FIG. 3 illustrates a structured data structure;

FIG. 4 illustrates a method of indexing data; and

FIG. 5 illustrates a method of retrieving data.

DETAILED DESCRIPTION

Embodiments illustrated herein can use a specialized index to make structured data structure searches much more efficient. In particular, Embodiments can use a compact index that is able to index structured data paths combined with data in a structured data path such that embodiments can quickly and efficiently identify particular structured data structures of interest. In particular, an improved computing system can be implemented by implementing embodiments of the invention illustrated herein. This can be accomplished by implementing a data indexing system capable of creating the compact index and a data retrieval system capable of greatly improved functionality over other systems, in that the data retrieval system can access data faster and using less computing resources than previous systems. Thus, a more efficient system can be implemented using the functionality illustrated herein.

Referring now to FIG. 1, an example is illustrated. FIG. 1 illustrates a data indexing system 100 which accesses a set of structured data structures 102. These data structures 102 may be internal to the data indexing system 100 as shown in FIG. 1, or external to the data indexing system 100. The structured data structures 102 contain data formatted in a structured data format, such as is common in xml documents.

FIG. 1 illustrates that the structured data structures 102 are provided to a pattern generator 104. The pattern generator 104 will generate various patterns using structured data paths and data from the structured data structures 102. Various examples of this will be illustrated below. However, for clarity, one example may be an xml data path which can be followed down to a data element. One pattern that may be generated includes the xml data path along with the particular data element. However, other patterns may alternatively or additionally be generated such as the data path along with a synonym of the particular data element, or an alternative spelling of the particular data element, or other related element to the particular data element that can be navigated to using the data path. In some embodiments, the pattern generator 104 will generate multiple patterns for a given data path/element combination. This will be illustrated in more detail below.

FIG. 1 further illustrates a hasher 106. The hasher 106 is configured to generate a hash for each of the patterns generated by the pattern generator 104.

FIG. 1 further illustrates an indexer 108. The indexer is configured to index the hashes generated by the hasher 106 with the appropriate structured data structures 102. Thus for example, if the pattern generator 104 generates a pattern from the particular structured data structure, and the hasher 106 computes a hash of that pattern, then the indexer 108 will index the computed hash correlated with the particular structured data structure in the hash index 110.

Thus, this generally results in several index entries in the hash index 110 for each data path/element combination.

A hash index 110 can be used to quickly and efficiently identify structured data structures from among the set of structured data structures 102 that contain a particular element in a particular data path in a structured data structure. Note that as used herein, and element may be a traditional data element occurring in the structured data paths. Alternatively or additionally, an element may be portion of data, and could potentially be unstructured data. For example, consider a case where a data path navigates to a transcription tag in an xml document. The transcription tag element may be a large amount of unstructured transcribed data. However, as used herein, an element may include one or more words from the transcription in the transcription tag. Thus, in this particular example, the pattern may include the path to the transcription tag along with one or more words contained in the transcription tag even though the words are included as part of the unstructured text in the transcription tag.

Referring now to FIG. 2, an example of using the hash index 110 to quickly retrieve structured data structures that include certain paths and elements is illustrated. In particular, the data retrieval system 200 receives a query 202 at a front-end 203 of the retrieval system 200. The front-end 203 passes the query to a pattern generator 204. In some embodiments, the pattern generator 204 can be eliminated and the query 202 can be passed directly to the hasher 206. For example, the query 202 may include the data path and one or more elements. This data path and element(s) can be hashed by the hasher 206. This hash is passed to the lookup tool 208. The lookup tool 208 can attempt to locate the hash in the hash index 110. As noted previously, the hash index 110 indexes hashes of generated patterns to structured data structures such as those included in the structured data structures 102. If the lookup tool 208 is able to match a hash of the query 202 with an entry in the hash index 110, the lookup tool can return to the front-end 203 the results from the hash index 110 identifying particular structured data structures that include data that matches the query 202. The front-end 203 can then return the results to a user 201, for example, at a machine used by the user 201 to generate and send the query 202.

The following now illustrates additional details. Note that while XML is shown, it should be appreciated that other structured data could be used in a similar fashion.

As the mixture of document and data base information has become integrated by using structured data structures, such as the Extensible Markup Language (XML) document format, an important frontier has been how to efficiently search collections of such flexible information rich documents. Enhancements are described herein that can enable rapid and scalable searching using XPath query or similar language that typically are implemented via slow scanned searching of individual structures, such as individual XML documents.

Instead of spreading information across data base records in multiple tables, structured and unstructured information is being integrated into documents. These documents include hierarchical XML tagged text, variables and values plus structured information about that data. Various tools including natural language processing and expert systems can enhance the value of the structured/unstructured information mixture making the XML/text combination less ambiguous, more information rich and more valuable to search.

Embodiments herein may include enhancements to the indexing and searching processes illustrated in U.S. Pat. No. 8,745,035 titled “Multistage Pipeline For Feeding Joined Tables To A Search System”, U.S. Pat. No. 8,392,426 titled “Indexing And Filtering Using Composite Data Stores”, U.S. Pat. No. 8,266,152 titled “Hashed Indexing”, U.S. Pat. No. 8,190,597 titled “Multistage Pipeline For Feeding Joined Tables To A Search System”, U.S. Pat. No. 8,176,052 titled “Hyperspace Index”, U.S. Pat. No. 8,032,495 titled “Index Compression”, U.S. Pat. No. 7,912,840 titled “Indexing And Filtering Using Composite Data Stores”, U.S. Pat. No. 7,774,353 titled “Search Templates”, U.S. Pat. No. 7,774,347 titled “Vortex Searching”, And U.S. Pat. No. 7,644,082 titled “Abbreviated Index, which are incorporated herein by reference in their entireties, controlled by a very flexible parse table control structure to enhance XML searching. These index entries with several collections of search acceleration keys enable rapid XPath or other XML searching queries that typically require scan searching of data bases full of XML documents.

Adding Patterns to Text and Variable Information to Include XML Tags with their Domains

By default, index keys are created for text and variable information in a document. For example, using the sample patient record highlighting patient identifiers used for multiple patient identifier (MPI) information records within an XML document.

This is an example of an XML template structure that would be used to generate a standardized set of documents using a limited set of XML tags and hierarchies. Such a system could have variations and different versions indicating updates and improvements to the XML template definitions in ongoing use and development or events like merging divisions or new companies into a single unified template within an organization.

Here is an example of a first section for patient name information.

  <Patient xmlns=″http://h17.org/fhir″> <id value=″1.MPID″ /> <contained>  <Patient>   <id value=″2222.NPI″ />   <name>    <family value=″lewis″/>    <given value=″michael″/>    <given value=″iván″/>    <suffix value=″jr″ />   </name> ...   </Patient>

Independent of the XML context, in some embodiments, these patterns are created and added to the hash index 110 when processing the word “lewis” zeroing in on the text itself in the document:

  Keyword: “lewis” Exact Phrase: “lewis michael” Exact Phrase: “lewis michael ivan” Content Phrase: “lewis michael” Content Phrase: “ivan lewis michael”

Following up the XML hierarchy, many patterns can be created for the hash index.

When the “family xml tag is processed at the end of the tag (“/>”), the following pattern could be created. Phonetic and other related text patterns could also be created.

Keyword: “lewis” xml:”family” Or: Keyword: “family:lewis”

When the name xml tag is processed at the </name> end tag, the following patterns can be created again at the word “lewis” in this first section of the <patient> tag. This XML tag is one level up from “family” and could be in one long string for the index access. The XML index entries can also be binary strings because of the more exact requirements that can be placed on the format of the text and data within the XML context. The XML tags that are to be made into these complete strings index entries would be declared in the parse table and include wildcarding of the XML tags.

  Keyword: “lewis” xml:”name” Exact Phrase: “lewis michael” xml:”name” Exact Phrase: “lewis michael ivan” xml:”name” Content Phrase: “lewis michael” xml:”name” Content Phrase: “ivan lewis michael” xml:”name” Binary with underscore separator lowercase: “lewis_michael_ivan_jr”

Other options for binary patterns:

-   -   Binary with space separator lowercase: “lewis michael ivan jr”     -   Binary with space separator originalcase with “,” separator         after <family>: “Lewis, Michael Ivan Jr”

Similar sets of patterns can be created starting at “michael,” “ivan” and “jr.” The binary pattern would apply for the data within the domain of the tag “name” and only one pattern would be created, unlike the exact phrase patterns that move across the four words of data. These binary patterns would be very useful in checking for various versions of the name associated with the particular patient record.

As embodiments continue processing the XML hierarchy, sets of patterns will be created with the <Patient> tag, and the <contained> tag as the ending XML tag is hit covering the information within the domain of the tag.

The next <Patient> record with <id value=2222.PI”> is now processed with several <identifier> tags at the sibling level.

  <Patient><id value=″2222.PI″/> <identifier>  <type>   <coding>    <system value=″http://h17.org/fhir/v2/0203″/>    <code value=″SS″/>    <display value=″Social Security number″/>   </coding>  </type>  <system value=″http://h17.org/fhir/sid/us-ssn″/>  <value value=″000000001″/> </identifier>

In this XML for master patient identifier systems, the details of the Master Patient Identifier use this particular template of XML with <identifier> tags giving more information about the ID in the original <patient> record without the subtag <identifier>.

Planned Key Explosion on Data and Text Side

The index 110 may already be under expansion pressure independent of the XML paths where the data is contained within. Indexing including grammatical forms, longer exact and content word phrases, capitalization retention patterns, misspelling autocorrection processing, plus other natural language and other processing that can improve the semantic accuracy of the search contribute to the expansion of the index 110.

Patterns may be multiplied by several times as a result of adding additional patterns including XML path additions and combinations. For example, atypical fully indexed system today has a 100% overhead of the index. That is, the index is the same size as the actual data being indexed. Assume extra patterns may increase that to 300% overhead. Then, if all paths and combinations of XML pieces are used with a complex XML hierarchy structure, there might be a 20× or even 50× or 100× increase index size, giving as high as 10,000% index overhead. Thus, some embodiments may implement at the pattern generator 104, careful use of the XML patterns selection and creation criteria which might result in, for example, 1,500% index overhead. Thus, for example, in this case, there would be 15 TB of index for 1 TB of data, which would be a reasonable index size that could be implemented.

This underlines the importance of careful XML path selection criteria and processing to give good XML context information in the index for XPath searching without overly expanding the index size as would happen with all combinations of all XML path lengths in the index and contexts of the path information being included. The expert system and assembly and subassembly indexing would be even a worse index size example in the unlimited case.

Kinds of XML Path Subsets that Guide Indexing and Searching

Even though implemented embodiments of the index 110 can handle billions and even trillions of unique non-duplicate 128 bit hashed keys, if all combinations of XML path keys are indexed, complete paths (e.g., partial paths, ordered contiguous paths, differing versions and orderings of XML tag templates, etc.) the resulting explosion of keys for the index 110 may overload the indexing capabilities. Thus, embodiments selectively index various parts of the complete XML path subsets that limit the explosion while being able to avoid complete scan searches and give keys that add seemingly nearly instantaneous searching to XPath searching.

To limit the proliferation of hashed keys for the index 110, embodiments can categorize various subsets of XML paths in a template that are worth indexing which can result in useful search results but help limit the generation of duplicate or less useful key creation.

Here are some of the possible XML path subsets, which to include would be specified in the parse table structure.

  1. Full path: /Patient/contained/Patient/name/family 2. N-XML adjacent tags in subpath: contained/Patient, type/coding/code 3. Discontinuous path elements: contained/*/name 4. Lowest level XML tag: family 5. XML that is ignored by itself, or is invisible: contained 6. XML that is invisible and also breaks exact phrase and other  patterns:   Example: Paragraph: <P> </P>

Example of Invisible Tag

A good example of an XML tag that could be excluded and not be used to make any additional sets of patterns is the <contained> tag for this template set for the multiple ID XML specification. <contained> includes the whole record and the individual information within its domains will have already been made into the basic data patterns plus additional patterns for the next levels of XML tags. This XML tag is typical of a container tag that in turn has several subtags indicating one or more subrecords in the container tag, in this case the <patient> tag that may occur multiple times. Invisible tags can be detected in many instances if the data part of the pattern is identical for a tag and its parent tag (e.g. type/coding/code above).

Example of Invisible Plus Exact Phrase Break

The paragraph tag is a good example of an invisible tag that limits the overlap of phrase words between sections of text. This avoids generating phrases that really are not indicated in the text or that make sense as a phrase. Embodiments may already generate text patterns for the text within the paragraph so it is invisible in the sense of not creating patterns specific to the paragraph tag.

Here is an example of two paragraphs and the exact phrase patterns that are generated at the boundary between the two paragraphs.

<P> The Lord is my shepherd; I shall not want. <//P>

<P> He maketh me to lie down in green pastures: he leadeth me beside the still waters. </P>

Exact Phrase and keyword Patterns at boundary:

“I shall not” “shall not want” “not want” “want”

“he maketh me” “maketh me to”

Abbreviated Path Information

The full path designation may become very long and more difficult to specify a valid search. Making the path elements of limited length can simplify the final keys. This may be particularly useful when verbose or dictionary kinds of definitional language for the XML tags are being used rather than shorter, more abbreviated forms.

For example, consider XML that contains parts and their relationships within a subassembly of a Boeing Jetliner.

To be able to search specifically for a fairly common bolt used inside the auxiliary power unit in the tail of most modern jets. An abbreviated version of the path might be /B737/fuselage/tail_section/APU/controls/bolts/bolt_635. Expanding the APU fragment of the path to auxiliary_power_unit as well as other tags could be quite a long string. Using a limiting length of 12 letters per level could replace auxiliary_power_unit to axiliary_po* still assisting the search but less verbose.

An example of criteria for selecting subsets of the path through the part and assembly explosion is a modified path in this example. B737/fuselage/tail_section/APU/*/bolt_635. The main classifications plus the subassembly entry point and terminal nodes like the bolt are XML levels that might be sufficient for indexing and searching without exploding the size of the index too much.

Note that in the part and assembly application of XML paths in this example, the large number of possible path and subpath patterns become very evident. Thus, it g may be advantageous to implement functionality to limit the number of patterns.

Efficient Search for a variety of Hierarchical Structures

The mixture of hierarchies and text indexed efficiently with a system such as one that includes a hash index 110 can integrate structured, semantically disambiguated data with the advantages of full text with all of its ambiguity and nuances.

Not only could this system work for engineering hierarchies for airplanes, but even the complex hierarchies of expert system logic can also be searched efficiently. FIG. 3 illustrates a decision tree 300 used for advanced logic for manufacturing process planning. The following illustrates how text and logic could also be indexed and searched.

For example, “Basic Shape” operations containing a “turn lathe 2” processing operation could be searched using hierarchies illustrated in FIG. 3 and indexed as described previously herein. Embodiments could be used to index defined subhierarchies and subtrees of expert system logic that would involve many subtrees and hundreds of levels of decision tree logic that in turn could link to other hierarchies of logic. Once again, it may be useful, depending on the hierarchy and resources available for indexing to limit the hierarchy+data index from handling every path and every combination of tags together within that path.

For example, embodiments may only index patterns for manual nodes, or only create index patterns for automatic nodes. Alternatively or additionally, embodiments may only create index patterns for the top nodes in hierarchal subtrees and/or the terminal nodes.

The following discussion now refers to a number of methods and method acts that may be performed. Although the method acts may be discussed in a certain order or illustrated in a flow chart as occurring in a particular order, no particular ordering is required unless specifically stated, or required because an act is dependent on another act being completed prior to the act being performed.

Referring now to FIG. 4, a method 400 is illustrated. In the illustrated example, the method 400 may be practiced in a data storage environment. The method includes acts for indexing data contained in a structured data structure.

The method 400 includes identifying a structured data structure path in a structured data structure (act 402).

The method 400 further includes identifying a first element related to the structured data structure path (act 404).

The method 400 further includes creating one or more patterns comprising at least a portion of the structured data structure path and one or more elements related to the first element (act 406).

The method 400 further includes for each of the one or more patterns, creating a hash (act 408).

The method 400 further includes indexing created hashes in a hash index by correlating the hashes in the hash index with the structured data structure (act 410).

The method 400 may be practiced where the structured data structure comprises an XML document.

The method 400 may be practiced where the structured data structure comprises a decision tree.

The method 400 may be practiced where the structured data structure is a JSON document.

The method 400 may be practiced where the structured data structure is an XML document.

The method 400 may be practiced where the first element comprises one or more structured data elements.

The method 400 may be practiced where the first element comprises one or more unstructured data elements (e.g., words from free form text) contained in the structured data structure.

Referring now to FIG. 5 a method 500 is illustrated. The method 500 may be practiced in a data storage environment. The method 500 includes acts for searching for data contained in a structured data structure.

The method 500 includes receiving a query, wherein the query comprises a structured data structure path and a first element related to the structured data structure path (act 502).

The method 500 further includes creating one or more patterns comprising at least a portion of the structured data structure path and one or more elements related to the first element (act 504).

The method 500 further includes for each of the one or more patterns, creating a hash (act 506). and

The method 500 further includes looking up the created hashes in a hash index to identity one or more structured data structures correlated to the hashes (act 508).

The method 500 further includes identifying to a user the one or more structured data structures (act 502).

The method 500 may further include providing a confidence level indicating a confidence that the one or more structured data structures match the query. For example, the confidence level is based on how similar the one or more related elements are to the first element.

Embodiments may be practiced by a computer system including one or more processors and computer-readable media such as computer memory. In particular, the computer memory may store computer-executable instructions that when executed by one or more processors cause various functions to be performed, such as the acts recited in the embodiments.

Embodiments of the present invention may comprise or utilize a special purpose or general-purpose computer including computer hardware, as discussed in greater detail below. Embodiments within the scope of the present invention also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. Such computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system. Computer-readable media that store computer-executable instructions are physical storage media. Computer-readable media that carry computer-executable instructions are transmission media. Thus, by way of example, and not limitation, embodiments of the invention can comprise at least two distinctly different kinds of computer-readable media: physical computer-readable storage media and transmission computer-readable media.

Physical computer-readable storage media includes RAM, ROM, EEPROM, CD-ROM or other optical disk storage (such as CDs, DVDs, etc), magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.

A “network” is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a transmission medium. Transmissions media can include a network and/or data links which can be used to carry or desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above are also included within the scope of computer-readable media.

Further, upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission computer-readable media to physical computer-readable storage media (or vice versa). For example, computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a “NIC”), and then eventually transferred to computer system RAM and/or to less volatile computer-readable physical storage media at a computer system. Thus, computer-readable physical storage media can be included in computer system components that also (or even primarily) utilize transmission media.

Computer-executable instructions comprise, for example, instructions and data which cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. The computer-executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.

Those skilled in the art will appreciate that the invention may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, pagers, routers, switches, and the like. The invention may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices.

Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Program-specific Integrated Circuits (ASICs), Program-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc.

The present invention may be embodied in other specific forms without departing from its spirit or characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope. 

1. A system for indexing data contained in a structured data structure, the system comprising: at least one processor; and at least one computer readable medium coupled to the processor comprising computer executable instructions that when executed by the processor implement: a pattern generator, wherein the pattern generator is configured to: identify a structured data structure path in a structured data structure comprising a plurality of records, each of the records comprising data values, wherein a particular record can be reached by following the structured data structure path; identify a first data value from the record; create one or more patterns comprising at least a portion of the structured data structure path combined with one or more elements related to the first data value such that at least one of the patterns comprises the structured data structure path and the first data value; a hasher configured to, for each of the one or more patterns, including at least one pattern that includes the first data value and at least a portion of the structured data structure path, create a hash; and an indexer configured to index created hashes in a hash index by correlating the hashes in the hash index with the structured data structure, including indexing the hash created for the at least one pattern comprising both the structured data structure path and the first data value.
 2. The system of claim 1, wherein the structured data structure comprises an XML document.
 3. The system of claim 1, wherein the structured data structure comprises a decision tree.
 4. The system of claim 1, wherein the structured data structure is a JSON document.
 5. The system of claim 1, wherein the structured data structure is an XML document.
 6. The system of claim 1, wherein the first data value is included in a structured data element.
 7. The system of claim 1, wherein the first data value is included in an unstructured data element contained in the structured data structure.
 8. In a data storage environment, a method of indexing data contained in a structured data structure, the method comprising: identifying a structured data structure path in a structured data structure comprising a plurality of records, each of the records comprising data values, wherein a particular record can be reached by following the structured data structure path; identifying a first data value from the record; creating one or more patterns comprising at least a portion of the structured data structure path combined with one or more elements related to the first data value such that at least one of the patterns comprises the structured data structure path and the first data value; for each of the one or more patterns, including at least one pattern that includes the first data value and at least a portion of the structured data structure path, creating a hash; and indexing created hashes in a hash index by correlating the hashes in the hash index with the structured data structure, including indexing the hash created for the at least one pattern comprising both the structured data structure path and the first data value.
 9. The method of claim 8, wherein the structured data structure comprises an XML document.
 10. The method of claim 8, wherein the structured data structure comprises a decision tree.
 11. The method of claim 8, wherein the structured data structure is a JSON document.
 12. The method of claim 8, wherein the structured data structure is an XML document.
 13. The method of claim 8, wherein the first wherein the first data value is included in a structured data element.
 14. The method of claim 8, wherein the first data wherein the first data value is included in an unstructured data element contained in the structured data structure.
 15. The method of claim 8, wherein creating one or more patterns comprises excluding one or more patterns that include a container tag that indicates one or more subrecords in the container tag.
 16. In a data storage environment, a method of searching for data contained in a structured data structure, the method comprising: receiving a query, wherein the query comprises a structured data structure path and a first data value; for the at least a portion of the structured data structure path combined with the first data value, creating a hash; looking up the created hash in a hash index to identity one or more structured data structures, wherein the hash index comprises a correlation of hashes with structured data structures, including the hash for the at least a portion of the structured data structure path combined with the first data value, the hashes in the hash index being based on hashes of structured data structure paths combined with values in records of the structured data structure that are reached by following the structured data structure paths to the records; identifying to a user the one or more structured data structures correlated to the hash for the at least a portion of the structured data structure path combined with the first data value.
 17. The method of claim 16, further comprising providing a confidence level indicating a confidence that the structured data structure matches the query.
 18. The method of claim 16, wherein the query comprises an XPath query.
 19. The method of claim 16, wherein the first data value comprises one or more structured data elements.
 20. The method of claim 16, wherein the first data value comprises one or more unstructured data elements contained in the structured data structure. 